Aligning Clattses in Parallel Texts
نویسندگان
چکیده
This paper describes a method for the automatic alignment of parallel texts at clause level. The method features statistical techniques coupled with shallow linguistic processing. It presupposes a parallel bilingual corpus and identifies alignments between the clauses of the source and target language sides of the corpus. Parallel texts are first statistically aligned at sentence level and then tagged with their part-of-speech categories. Regular grammars functioning on tags, recognize clauses on both sides of the parallel text. A probabilistic model is applied next, operating on the basis of word occurrence and co-occurrence probabilities and character lengths. Depending on sentence size, possible alignments arc fed into a dynamic progranuning framework or a simulated annealing system in order to find or approxim~te the best alignment. 1he method has been tested on a Small Eng~ lish-Greek corpus consisting of texts relevant to software systems and has produced promising results in terms of correctly identified clause alignments.
منابع مشابه
Aligning Turkish and English Parallel Texts for Statistical Machine Translation
This paper presents a preliminary work on aligning Turkish and English parallel texts towards developing a statistical machine translation system for English and Turkish. To avoid the data sparseness problem and to uncover relations between sublexical components of words such as morphemes, we have converted our parallel texts to a morphemic representation and then used standard word alignment a...
متن کاملSentence Alignment of Brazilian Portuguese and English Parallel Texts
Parallel texts – texts in one language and their translations to other languages – are becoming more and more available nowadays on the Web. Aligning these texts means to find some correspondence between them, in sentence level, for instance. In this paper we describe some experiments done with Brazilian Portuguese and English parallel texts using five well known sentence alignment methods. The...
متن کاملA Program for Aligning Sentences in Bilingual Corpora
Researchers in both machine Iranslation (e.g., Brown et al., 1990) and bilingual lexicography (e.g., Klavans and Tzoukermann, 1990) have recently become interested in studying parallel texts, texts such as the Canadian Hansards (parliamentary proceedings) which are available in multiple languages (French and English). This paper describes a method for aligning sentences in these parallel texts,...
متن کاملChar_align: A Program for Aligning Parallel Texts at the Character Level
There have been a number of recent papers on aligning parallel texts at the sentence level, e.g., Brown et al (1991), Gale and Church (to appear), Isabelle (1992), Kay and Ro . . senschein (to appear), Simard et al (1992), Warwick− Armstrong and Russell (1990). On clean inputs, such as the Canadian Hansards, these methods have been very successful (at least 96% correct by sentence). Unfortunate...
متن کاملAligning Noisy Parallel Corpora Across Language Groups : Word Pair Feature Matching by Dynamic Time Warping
We propose a new algorithm, DK-vec, for aligning pairs of Asian/Indo-European noisy parallel texts without sentence boundaries. The algorithm uses frequency, position and recency information as features for pattern matching. Dynamic Time Warping is used as the matching technique between word pairs. This algorithm produces a small bilingual lexicon which provides anchor points for alignment.
متن کامل